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Abstract 


We  prove  a number  of  new  properties  of  algorithms  of  the 
Conjugate  Gradient  type,  paying  particular  attention  to  methods 
which  utilize  variable  metric  information  in  determining  the  conjugate 
gradient  search  directions.  We  attempt  a comprehensive  discussion 
of  conjugate  gradient  methods,  and  present  each  algorithm  within 
the  context  of  other  existing  algorithms,  an  approach  which  provides 
fresh  insights  and  some  new  algorithms. 


A STUDY  OF  CONJUGATE  GRADIENT  METHODS 


L.  Nazareth  and  J.  Nocedal 

1.  Introduction 

In  196^  Fletcher  and  Reeves  [1]  showed  how  the  conjugate 
gradient  method  of  Hestenes  and  Stiefel  [2]  for  solving  systems  of  linear 
equations  could  be  extended  and  used  to  find  local  minima  of  non-linear 
functions.  Since  then  many  variants  of  this  algorithm  have  appeared 
in  the  literature.  Methods  belonging  to  the  conjugate  gradient  family 
are  particularly  valuable  when  the  number  of  variables  is  large. 

In  this  paper  we  prove  a number  of  new  properties  of  conjugate 
gradient  type  algorithms,  paying  particular  attention  to  methods  which 
utilize  variable  metric  information  in  determining  the  conjugate  gradient 
steps.  We  attempt  a con£>  rehens  ive  discussion  of  conjugate  gradient 
methods,  and  present  each  algorithm  within  the  context  of  existing 
algorithms,  an  approach  which  yields  dividends  by  providing  fresh 
insights  and  some  new  algorithms.  Our  concern  is  with  conjugate  gradient 
methods  for  non-linear  unconstrained  optimization.  Extensive  work  on 
conjugate  gradient  methods  for  linear  systems,  e.g.  Concus,  Golub  and 
O'Leary  [3]  is  not  discussed  here,  though  we  believe  our  work  also 
has  applications  in  this  area. 
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Our  paper  is  organized  as  follows: 

Basic  conjugate  gradient  methods  are  discussed  in  Section  2. 
An  important  generalization  of  conjugate  gradient  methods  in 


which  the  metric  is  varied,  are  discussed  in  Section  3. 

In  Section  4 we  discuss  conjugate  gradient  methods  which  relax 
the  requirement  that  line  searches  be  exact. 

Finally  in  Section  5,  we  look  at  extensions  of  algorithms  in 
Section  4 to  the  case  when  the  metric  is  varied. 

Within  each  of  the  above  sections,  we  discuss  one  or  more  of 
the  following: 

a)  basics — algorithms,  properties  and  interpretations 

b)  generalizations — in  particular  to  arbitrary  metrics  and  to  arbitrary 
starting  directions 

c)  strategies — scaling,  and  when  and  how  often  to  restart. 

Subsequent  parts  of  this  research  deal  with  convergence  analysis  [4]  and 
we  hope  eventually  to  develop  a documented  and  distributable  FORTRAN 
implementation. 
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Conjugate  Gradient  Methods 


We  summarize  in  this  section  some  known  results  about  the 
conjugate  gradient  method.  We  are  concerned  with  the  problems  of  finding 
a local  minima  of  a function  f(x),  x € ]Rn.  We  denote  this  gradient 
of  f at  **  *y  gU*)  or  gk.  Let  yk  = gk+1  - 8*  and  *k  = Vl ' V 

2.1.  Basics 

2.1.1.  Algorithms 

The  conjugate  gradient  method,  originally  developed  by  Hestenes 
and  Stiefel  [2]  to  solve  systems  of  linear  equations,  was  adapted  to  the 
non-linear  unconstrained  optimization  problem  by  Fletcher  and  Reeves  [1] 
in  the  following  way: 

Given  xQ,  let  dQ  = -g(xQ). 

For  k = 1,  2,  ...  , let 

dk  = "8k  + W-l 

where  pk  = II«kll2/llek_1 1|2  and  ||-||  is  the 

Vi = *k + w 

where  is  chosen  to  minimize  f along 

This  algorithm  is  restarted  every  n 


(2.1) 

Euclidean  norm. 

(2.2) 

V 

or  (n+l)  iteration. 
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When  applied  to  the  quadratic  function  *(x)  = - (x-x*)  A(x-x*), 

x G ®n  , when  A is  a positive  definite  and  symmetric  matrix,  each  of 
the  above  three  algorithms  has  the  following  properties  (i)  finite 
termination  in,  at  most,  n steps;  (ii)  g^g^  =0,  Vi  > j,  and 
(iii)  d^  =0,  i / j (i.e.  directions  are  conjugate).  As  long  as 
gj^  t 0,  6^  will  be  linearly  independent  of  g^, ...jg^  and  a new 

conjugate  direction  d^  can  be  developed.  If  g^  = 0,  for  some  k < n, 
then  the  minimum  is  located  in  fewer  then  n steps. 

Let  G = [gQ, . . .,gk_1],  D = [ dQ, . . •,dk_1].  Then  for  quadratics 

(i)  Each  direction  lies  in  the  space  spanned  by  previous  gradients, 

i.e.  -G  = DR  when  R is  upper  triangular — called  the  direction- 

gradient  relation. 

T 

(ii)  D AD  = a,  where  a denotes  a diagonal  matrix 

(iii)  A(xi-xi_1)  = (gj^-gj^  ^)>  i = 1,2, ...,k.  These  can  be  written  as 
ADA  = GH  where  A = diag(>Q  **•  A^),  is  8iven  by  (2-2> 
and  H is  a particular  upper  Hessenberg  matrix 
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(iv)  G G = (3,  p is  also  diagonal 
In  summary 

-G  = DR 
DTAD  = a 


AD*  = GH 
T 

GlG  = 0 


(2.3a) 


These  relations  can  he  manipulated  as  discussed  in  Nazareth  [6], 

All  three  algorithms  mentioned  above  are  the  same  for  quadratics. 

For  arbitrary  functions,  even  when  line  searches  are  not  exact,  successive 

search  directions  in  the  Hestenes-Stiefel  algorithm  are  "conjugate,"  i.e. 

T 

dj^y^  ^ = 0.  ^or  arbitrary  functions,  the  Hestenes-Stiefel  and  Polak- 
Ribiere  algorithms  are  the  same  when  line  searches  are  exact.  When 
applied  to  arbitrary  functions  and  line  searches  are  inexact,  the 
algorithms  differ,  and  Powell  [7]  has  explained  why  the  Polak-Ribiere 
variant  is  to  be  preferred.  The  reason  is  that  if  poor  search  directions 
are  being  generated,  and  successive  iterates  are  close,  so  g^  = g^  , 
then  °»  hence  the  search  direction  reverts  to  the  negative  gradient 

direction,  permitting  the  algorithm  to  recover. 
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It  has  been  shown  that  all  the  above  algorithms  have  an  n-step 
quadratic  rate  of  convergence  when  line  searches  are  exact,  Daniel  [8], 
Cohen  [9],  Polak  [10].  This  condition  on  accuracy  of  the  line  search 
can  be  relaxed  to  "asymptotically  exact,"  see  Kawamura  and  Volz  [11], 
Lenard  [ 12 ] . 

b)  If  the  starting  search  direction  is  not  along  the  negative 
gradient  direction  (-g^) , then  the  above  algorithms  do  not  usually  have 
finite  termination  on  a quadratic.  Indeed,  Powell  [13]  has  shown  that 
either  termination  occurs,  or  this  rate  of  convergence  is  linear,  the 
second  being  more  usual.  The  algorithms  can  be  modified  to  retain  the 
quadratic  termination  property  as  described  in  Section  2.2. 

c)  There  are  three  interpretations  of  the  conjugate  gradient 
method  (2.1),  as  applied  to  quadratic  functions,  which  are  of  value  in 
explaining  some  of  its  properties.  Each  interpretation  also  serves  as 
a good  staging  ground  for  extending  the  conjugate  gradient  method  and 
analyzing  the  resulting  algorithms,  as  we  shall  see  later. 


(l)  Interpretation  I:  Relationship  to  BFGS  method 

The  BFGS  method  [14]  is  currently  considered  to  be  the  most 
effective  member  of  the  Broyden  S-class  [14]  of  Variable  Metric  updates 

g 

(see  Appendix  1 for  a definition  of  the  3-class  and  dj) . When  the  c.g. 
method  or  any  method  of  the  S-class  is  applied  to  a quadratic  function,  with 


ic8 


- dr  - 


^ ■ -g^  and  line  searches  are  exact,  it  is  well  known  that  there  is 
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no  flexibility  in  the  choice  of  subsequent  search  directions  up  to  multi- 

cg  8 

plication  by  a scalar;  dj  and  d..  are,  in  consequence,  linearly 
dependent.  In  Nazareth  [15],  it  is  shown  that  the  member  of  the  8-class 
for  which  these  vectors  are  precisely  the  same  (i.e.  equal  in  magnitude 
and  direction)  is  the  BFGS  update.  The  result  is  also  taken  a step 
further  in  [15],  where  it  is  shown  that  for  arbitrary  functions  the  BFGS 
algorithm  may  be  interpreted  as  a conjugate  gradient  method  in  which  the 
metric  is  updated  at  each  step,  using  any  member  of  the  B-class.  This 
observation  leads  to  the  generalized  conjugate  gradient  methods  of 
Section  3,  where  we  shall  elaborate  further  upon  these  brief  remarks. 


(ii)  Interpretation  II:  Specialized  Gram-Schmidt 

Given  any  vector  g^,  and  a set  of  search  directions  d^, ...,d 

which  are  conjugate  w.r.t  A (orthogonal  in  the  inner  product  defined 

by  A),  then  vector  d^  which  lies  in  the  subspace  g^,  d^,  ...,d^.  ^ 

and  is  conjugate  to  d. ,...,d.  , will  have  a component  in  each  of  the 

x 

directions  d^, ...,dj  However  when  g^  is  the  gradient  vector  at 

Xy  then  g^  is  itself  conjugate  to  d^, ...,dj  Thus  the  conjugate 

gradient  method  can  be  viewed  as  a specialized  version  of  Gram-Schmidt 

where  the  vector  d.  is  chosen  to  lie  in  the  space  spanned  by  g. 

J J 

and  d which  is  conjugate  to  d . Note  that  these  statements 
J*-** 

require  that  line  searches  be  exact.  Later  we  show  that  when  line 
searches  are  not  exact,  another  specialized  version  of  Gram-Schmidt 
arises,  and  this  in  turn  leads  to  a very  natural  extension  of  the  conjugate- 
gradient  method. 
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(iii)  Interpretation  III;  Implicit  Lanczos 

This  Lanczos  process  is  a particular  generalized  Hessenberg 
process,  the  latter  being  defined  as  follows: 

Given  a matrix  A and  a set  of  n linearly  independent  vectors 
x^  which  are  columns  of  X,  and  an  arbitrary  initial  vector  g^,  develop 
vectors  g^  g^  ...  , gR  with  G ^ (gQ,g^> . . .,gn_^)  s.t. 

AG  = GH 


T 

G X = U 


(2.4) 


where  H is  upper  Hessenberg  and  U is  upper  triangular,  xi  need  not 
be  specified  beforehand.  If  they  are  taken  to  be  the  same  as  and 

if  A is  a symmetric  matrix,  then  (2.4)  becomes  Arnold's  method  or 
the  symmetric  Lanczos  method,  (see  Wilkinson  [16])  for  tridiagonalizing 
a symmetric  matrix,  the  latter  two  methods  being  equivalent.  This  is 
given  by 

AG  = GT 


T 

G G = a 


(2.5) 


T tridiagonal,  a diagonal. 

Note  in  particular  that  g^  £ [gQ,  AgQ, ...,A  gQ]  and  is  orthogonal 
to  [gg,  g^, ...,  where  [uq,...,u^]  denotes  the  subspace  spanned 

by  Uq,  ...,  u^.  It  is  not  difficult  to  see  that  the  c.g.  method 
implicitly  carries  out  the  above  process,  where  G is  identified  with 
the  matrix  of  gradients. 


If,  for  some  k,  Agk  € [gQ,  g^ ...,  gfe]  = [gQ,  AgQ,  , AkgQ] 
then  the  successive  gradients  gQ,  ...  , 6^+^  are  linearly  dependent. 

In  this  case  the  minimum  lies  in  the  subspace  spanned  by  dQ,  ...  , d^ 
(see  Nazareth  [17])  and  = 0.  When  A has  only  m < n distinct 

eigenvalues,  it  is  easy  to  show  that  the  Krylov  sequence 

2 

gQ,  AgQ,  A gQ,  ...  , has  only  m linearly  independent  vectors.  Thus 
the  conjugate  gradient  method  will  terminate  in  m steps. 

We  shall  later  show  that  the  Lanczos  process  and  an  associated 
Krylov  sequence  also  underlies  a member  of  the  conjugate  gradient  family 
of  methods,  called  the  three  term  occurrence. 

Some  convergence  results  very  closely  related  to  the  above 
interpretation  are  given  in  Luenberger  [18],  where  the  c.g.  method  is 
viewed  as  an  optimal  process  over  a space  of  polynomials. 

Given  an  arbitrary  starting  point  Xq  let 

Vi  " xo  + pk(A)8o 

where  Pk(A)  is  any  polynomial  of  degree  k. 

If  x*  is  the  optimum  point,  then 

g0  = A^x0  ‘ X*^  ( 2* 6) 

and 

(xk+1  - x*)  = [I  + APr(A) ] (Xq  - x*)  (2.7) 

It  can  be  shown  that: 


* 

J 
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(i)  E(xk+1)  = f(xk+1)  - f(x*)  =|  (x0-x*)TA[I+APk(A)]2  (xQ-x*)  (2.8) 


(ii)  The  conjugate  gradient  method  implicitly  selects  the  polynomial 


Pk(A)  of  degree  k for  which  Ek+1  is  minimized. 


(iii)  E(xk+1)  < max[l  + ] E(xQ) 


(2.9) 


where  the  maximum  is  taken  over  all  eigenvalues  of  A.  If  there 

are  m distinct  eigenvalues,  then  P can  be  chosen  so  that 

m 

E = 0 and  from  (l)  it  follows  that  the  conjugate  gradient 
m+1 

method  converges  in  as  many  steps  as  there  are  distinct  eigenvalues 
(as  was  seen  above  in  the  discussion  on  the  Lanczos  process). 
Luenberger  [18]  also  shows  that 


U_  t - \) 

E(x.)<— — ^E(x_),  0 < t < n (2.10) 

* <Vt  ♦ V 0 

where  \»***»^n  are  eigenvalues  of  A in  non-decreasing  order. 


2.2.  Generalizations 

Let  F be  the  strictly  convex  quadratic  function 

F(x)  = | (x-x*)T  A(x-x*)  (2.11) 

and  let  H be  a symmetric  and  positive  definite  matrix  (H  > 0).  H 

T 

can  be  factorized  in  various  ways,  e.g.  as  H = LL  , where  L is  lower 
triangular  and  nonsingular. 

Let  us  define  the  transformation  of  variables  (x-x*)  = Lz  and 
let  us  represent  the  function  in  the  z-space  by  h(z).  Then 
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(2.12) 


h(z)  = f (x*  + Lz)  = f(x)  = i zTLTALz 


and 


Vh(z)  = (LTAL)z  = L^Vf (x)  . 


Also  Vh(0)  = 0,  and  correspondingly  Vf(x*)  = 0. 

Suppose  that  we  apply  the  conjugate  gradient  method,  say  the 
Fletcher-Reeves  to  h(z).  Let  us  denote  VhCz^)  by  g^  and  search 

A A (Ji  A 

directions  in  the  z-space  by  d^.  Thus  g^  = L g^  and  d^  = Ld^. 
Then: 

A A 

d0  = “g0  - „ 


\ = -gk  + 


I'M 


Ml 


(2.13) 


sk-l 


Also  if  is  the  step  from  a point  x^  along  d^  which  minimizes 

f (x),  then  is  also  the  step  from  z = L (x^-x*)  along  d^  which 

ip  A fpA 

minimizes  h(zk)  * (g(xfc)  = g(zfc)  d^. 

In  the  space  of  the  original  variables  (2.13)  becomes 


ao  ■ -"«o 

lit  III 

\ + Vi 

K-A 


(2. 14) 


T 

where  ll'vlljj  = v ^v- 

We  call  (2.1U)  the  conjugate  gradient  method  with  metric  H 
or  preconditioned  conjugate  gradient  method  (Axelsson  [28]).  Some 
properties  of  the  method,  analogous  to  those  discussed  in  Section  2.1,  are: 
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T 

when  {cr^,  are  the  eigenvalues  of  L AL  in  increasing  order  and 


A ITT 

E(zt)  = - z L ALz  = E(x^),  and  so  the  conjugate  gradient  method  with 

metric  H satisfies  (2.15).  If  H resembles  A 1 we  can  expect  that 
T 

L AL  ’’approaches"  the  identity  so  that  the  function  values  will 
decrease  faster. 

Suppose  that  H was  obtained  from  a quasi-Newton  iteration,  which 
for  a quadratic  satisfies 

tyj  = Sj  , j = 1,2,. ..,t  (2.16) 


2.2.2.  Generalization  to  Arbitrary  Starting  Direction 


So  far  we  have  assumed  that  dQ  = -gg.  Beale  [20]  has  shown 
how  dQ  may  be  taken  to  be  an  arbitrary  starting  direction  and  the 
quadratic  termination  properties  of  the  conjugate  gradient  method 
retained  by  modifying  the  defining  relations  as  follows: 


dQ  given 


Vo 

di  = "gi  + Tt~  do 
Vo 

T 

. _ _ + Vk-l 

dk-lyk-l 


(2.19) 


dk. 


T 

Vo 

l + V k > 2 


Vo 


xk+l  = ^ + 'W  wliere  ^ Is  ch°sen  to  minimize  the 
function  along  d^. 


Equations  (2.19)  defines  a cycle  of  n-steps.  When  the  cycle  is  finished, 
a new  direction  is  defined,  and  a new  cycle  is  stated.  Note  that 

when  dQ  is  the  negative  gradient,  we  obtain  the  usual  conjugate 
gradient  method. 

It  is  easily  seen  that  Beale's  method  applied  to  a quadratic 
function  develops  conjugate  directions  and  find  the  solution  in,  at 
most,  n iterations. 

Also  it  is  easily  seen  that 

dl  G [g0’  V AV 

d2  G [g0'  Ag0’  V AV  A V 

and  in  general 

\ e (gQ,  •••  > Ak-1g0,do>A^o'  • • • »A  V* 
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Therefore 


t'V  * * *,dk^  ^ [&0> 


A Ak-1 

AgQ, . . .,A  gQ, 


V 


Ad- 


Akd0] 


(2.20) 


Let  E(x^)  be  defined  as  in  (2.8)  and  suppose  that 

°<*<  \<  \ < •••  < VkSb 

b < W s W>  i — <\ 


(2.21) 


where  the  V s are  the  eigenvalues  of  A. 

We  now  prove  the  following  result,  which  extends  (2.10). 


Theorem  2.1.  For  any  xQ  € ]R n,  but  x^,  x^,  . . . , xr  be  generated  by 
Beale's  Method  when  applied  to  the  quadratic  function  \|r(x)  = — (x-x*)  A(x-x*). 
Assume  that  (1q  is  such  that 


I <*r 


k-1 


■ >d^]  [ 8q>  Ag^,  ...,A  gQ,  d| 


Ad- 


Akd0] 


(1) 


(2.21a) 


for  1 < k < n.  Then 

E^Xk+l^  - (b+a)  [E(X0}  + d0A^X0  " X + 2 d0^  * 

Proof.  Beale's  method  is  a conjugate  direction  method,  and  therefore 
minimizes  \|r(x)  (also  the  function  E)  over  the  space  [dQ, . ,.,d^]. 
Consider  the  iteration 

xk+l  = x0  + Pk-1^®0  + Sk^d0  * (2.22) 

^Conditions  which  ensure  this  are  given  below. 

lU 


where  P and  Sk  are  polynomials  in  A of  degree  k-1  and  k 
respectively.  Thus  the  polynomials  that  minimize  E(xk+1)  are  those 
obtained  by  Beale's  method.  Now 

(^k+i  - x ) = (xQ-x*)  + Pk-1(A)  A(xq-x*)  + Sk(A)dQ  (2.23) 

= [I  + pk_1(A)AJ  (Xq-x*)  + sk(A)d0 

Choose 

sk  = (I  + pk_l(A)A]  (2.24) 

We  have 

E(xk+i)  < \ ((xq-x^)  + [I  + Pk-1(A)A]  A[I  + Pw(A)A](^.x*+^)] 
= \ (*Q-x  )T  A[I  + pk_1(A)A]2  (xQ-x*) 

+ + Pk_1(A)A]2  (xQ-x*)  + | djA[I  + pk.1(A)A]2  dQ 

(2.25) 


Let 


Then 


{e^  be  an  orthonormal 
* n 

(x  -x  ) = Z 5 .e 
i=l 


set  of  eigenvalues  of  A and  let 

and  pk_^(A)  = aQI  + a^A  +* * •+ak.1Ak“1 


Pk-l(A)(xO“x#) 


t1 +Apk_1(A)](x0-x*) 


ao  + 


n 

z 

i=l 


E5iei + 
Z liei(l 


5iei+‘’*+  “k-l  ^ 5ieiAi  1 

ao  Z !l'lV""afc.lE  5iVl 
* Vi  > 


(2.27) 


(2.28) 


(2.26) 
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Pk-1^A^Z  +APk-l^A^^X0‘X  ^ 

- “oV-%-ix? “n.ii:!i'i^1(1,0bV-\Ak)  (2-29) 


[I+APk_1(A)]2(x0-x*) 

= E + a0Zxiliei(1+“’+  ak-ixi}  +‘“+  Q£k-iZ5ieiAi(1  + “+  ak-ixi)  (2*50) 


A[I+APk_1(A)]2(x0-x*) 


= z ^iei\(i + a0\+’ * *+  Vixi)(1  + W-+  Vi^ 


= E + \Pk-l(Ai)]i 


(2.31) 


(xq-x  ) A[ I + APk_1(A)]2(xQ-x*)  = Z 6^(1  + Vk-l^V^ 


Let  dQ  = E"=1  then 


<1qA[ I + APk_1(A)]2(xQ-x*)  = Z I^A^l  + >iPk.i^i)J2 


(2.32) 


<$UI  + APk_1(A)]2dQ  = Z |2Ai(l  + \Pk.j.(\)]S 


(2.33) 


Therefore,  for  any  P. 


E(Vl)  — Z^2  *i  + ?iPi  + 2 Pi>  V1  + Vk-l(V]J 

< max  [1  + ^Pp.-^)]2  \ E(ft1  + 0^% 


(2.3*0 


or  alternatively 

E(xk+J  < max  [1  + AiPk_1(A1)]2[E(x0)+djA(x0-x#+|  dQ)]  (2.35) 

\ 

The  rest  of  the  proof  follows  as  in  Luenberger  [18], 

Observe  that  (2.20) 

[cIq ,<^1  £ [gQjAgQ,  jA*’1^,  d0,Ad0,...,Akd0]  (2.36) 

only  shows  that  the  first  space  is  contained  in  the  other.  In  order  to 
use  Theorem  2.1,  we  must  establish  conditions  which  ensure  equality 
of  the  two  spaces. 

(1)  Clearly  d^  = -gg  will  do,  which  gives  the  usual  conjugate  gradient 
method. 

(2)  Another  possibility  is  that  d^  is  an  eigenvector  of  A.  We  do  not 
know  these  eigenvectors.  However  this  suggests  that  we  consider 
Beale's  method  with  arbitrary  metric  H,  where  the  directions  are 
defined  by 

“k  * -"Sk  * Vk-l  * rka0 

^k  = gkHyk-l^dk-lyk-l 

if  k > 1 

(2.37) 

if  k = 1 


o 
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In  this  case  we  have  that 


[d0,...,dk]  € [HgQ, (HA)Hg0,...,(HA)k_1Hg0,d0,HAd0,...,(HA)kd0]  . (2.38) 

If  H was  obtained  by  a quasi-New+on  update  formula  using  d^,  then 

Hy0  = HAd0  = dQ  (2.39) 

so  that  dQ  is  an  eigenvalue  of  HA  and  (3.38)  will  hold  with  equality. 
Theorem  2.1  will  hold  (except  that  the  A now  are  the  eigevalues  of  a 
different  matrix).  Beale's  method  then  takes  advantage  of  the  eigenvalue 
distribution  of  A,  which  is  a very  desirable  property. 

In  short  one  would  implement  the  method  as  follows t 

(a)  Choose  any  <Jq  / 0. 

(b)  Find  H satisfying  HyQ  = dQ,  using  some  quasi-Newton  update  formula. 

(c)  Continue  with  Beale’s  with  metric  H,  given  by  (2.37). 

Notice  that  if  (2.39)  holds  then  3^  = 0 and  therefore  d^  = -Hg^.  So 
we  obtain  the  conjugate  gradient  method  with  metric  H.  For  inexact  line 
searches  or  general  nonlinear  functions  3^  / 0 and  a different  algorithm 
will  be  obtained. 


2.3.  Strategies 

Two  strategies  which  have  a great  impact  on  performance  are 

(1)  How  often  to  restart.  This  is  nicely  discussed  in  Powell  [7]. 

(2)  Initial  scaling  of  the  search  direction.  For  a good  discussion  of 
this  see  Shanno  [21]. 
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3.  Conjugate  Gradient  Methods  with  Variable  Metric 


An  important  generalization  of  the  conjugate  gradient  method 
(2.1)  is  based  upon  the  first  interpretation, see  p.  6 . Two  variants 

that  have  been  suggested  are  the  variable  storage  generalized  conjugate 
gradient  method  (VSGCG),  Nazareth  [15]  and  the  Interleaved  quasi-Newton- 
conjugate  gradient  method  of  Buckley  [22]. 


3.1.  Variable  Storage  Generalized  Conjugate  gradient  (VSGCG)  method, 
Nazareth  [15]. 

When  line  searches  are  exact,  it  is  shown  in  [15]  that  the  BFGS 
method  can  be  stated  as  follows: 


BFGS 

1 

,BFGS 


"-“A 


_H!j-igj 


yj-i^-iei 

T .BFGS 
yj-ldj-l 


BFGS 

j-1 


(3.1) 


BFGS 

whose  xL  and  ^ > 0 are  given,  x^+1  = xj  + > 

BFGS  R ft 

A = arg  min  f(x.  + Ad“  ) and  ITT  is  developed  from  using 

J ^ J J J J™-*- 

any  member  of  Broyden' s 3-class  (see  Appendix). 

By  comparing  (3.1)  and  (2.lU)  it  can  be  seen  that  the  BFGS  method 
can  be  interpreted  as  a conjugate  gradient  method  in  which  the  metric  is 
changed  at  each  step.  When  storage  is  limited  this  suggests  that  some 
simple  be  used,  e.g.  a diagonal  matrix  and  that  the  vectors  defining 

the  rank  1 or  2 updates  be  saved.  These  are  then  used  to  define  the 
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metric  which  can  vary  or  stay  the  same  from  one  iteration  to  another, 
depending  on  the  storage  available.  The  resulting  family  of  algorithms 


is  discussed  in  Nazareth  [15]  and  is  as  follows:  Let  {x^,x2,...}  be 

the  points  generated  and  (H^H^  ,H^  ,...)  the  matrices.  indicates 

■L  J2  J2 

that  it  is  the  third  matrix  and  that  it  was  generated  at  x.  . (N.B.  In 

J2 

practice  these  will  be  defined  implicitly  by  the  vectors  defined  in 
the  update  functions.)  d^  (H)  will  denote  a conjugate  gradient  at  x^ 
using  metric  H.  Let  T = {j^, j^,...}  be  the  set  of  indices  where  up- 
dates are  performed.  The  VSGCG  iteration  is 


U(Hi’  Sk-1’  yk-l}  if  k e T 
undefined  for  k ^ T . 


Here  U denotes  the  update  function  of  Broyden' s P-class  (see  Appendix) 
and  i € T is  the  integer  preceding  k in  T. 

\ " dkGK>  * 

Vi  = xk  + w 

Note  that  matrix  used  is  not  the  most  recent  one,  but  the  previous  one. 


Theorem  3.1.  Let  H be  any  symmetric  and  positive  definite  matrix. 
Then  the  VSGCG  method  with  exact  line  searches,  starting  with  = H 
has  the  quadratic  termination  property. 
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1 


Proof.  d1  = dGG(H),  d2  = dGG(H). 

By  the  orthogonality  properties  of  the  CG: 

= gjHg1  = g^Hg1  = 0 


g3d2  ~ g3dl  = g2c 


0 . 


Assume  that  d£G(H®  ) = d£G(H)  for  2 < k where  H®  6 {H^ 
m ml 

and  t < k-1.  (That  is,  we  are  assuming  that  it  is  equivalent  to  use  any 

of  the  previous  matrices  to  do  the  step. ) Also  assume  that 


= 0 


gk+l  dj  ° 


J = 1, 2, ...,k 


We  write  Broyden's  formula  as 


Vi  ~ + a^sj  VdV 


Then 


HJ  gk+l  ' Hj  gk+l  + aj  -lSj  -lgk+l  + bJ  -lyj  -1  HJ_  A+l 
m m-1  m m m m m-j. 


VlVl  = Vs®**1  * V2  VA+1  * V2V2"V2Vl 


nj^+l  = ^+1  + al8lgk+l  + ylHgk+l’ 
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We  have  deleted  some  superscripts,  for  simplicity.  By  the  induction 
hypothesis,  the  above  equations  give 


gk+l  " H6k+l,*,,,Hj  -2®k+l  ^k+l  * 
1 m 


Then 


gk+l  ^k+l  ' 
’'m 


y * 

m 


dk+l(Hjm)  _gk+l  + jT . \ 


Vk 

" “®k+l  ,T  \ Tt+r11' 

Vk 


So  the  induction  holds  and  shows  that  each  step  of  the  VSGCG  method  is 
the  same  as  a CG  step  with  metric  H.  The  result  follows  from  the 
quadratic  termination  property  of  the  CG  method. 


Observe  that  the  above  result  is  independent  of  what  member  of 
Broyden' s class  one  chooses.  This  is  in  sharp  contrast  with  the  Inter- 
leaved Method  where  BFGS  has  to  be  used,  as  discussed  next. 

Some  specific  updating  strategies  allowed  by  Theorem  3.1  are 
the  following:  (a)  updating  at  every  iteration,  (b)  resetting 

to  H after  a certain  number  of  steps,  and  (c)  resetting  to 

any  previous  matrix  H . 
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5.2.  Interleaved  Quasi -Newton-Conjugate  Gradient  Method,  Buckley  [221 

This  method  performs  QN  and  c.g.  steps  intermittently.  The  c.g. 
iteration  is  carried  out  with  the  metric  defined  In  the  previous  QN 
cycle.  The  metric  is  not  updated  during  the  c.g.  iterations.  Again  the 
quasi-Newton  updates  are  defined  by  rank  2 corrections,  and  they  will 
not  be  stored  in  matrix  form,  but  will  be  kept  individually.  Again  as 
in  the  VSGCG  method  as  many  corrections  are  retained  as  our  storage 
capacity  allows  us.  In  what  follows,  it  is  only  important  to  keep  in 
mind  that  during  each  QN  iteration  the  matrix  is  updated  and  during 
each  c.g.  iteration  it  is  held  fixed. 

An  interleaved  QN-CG  has  been  studied  by  Buckley.  In  [ 22]  be 
states  a theorem  (see  Theorem  3.3  below),  which  we  generalize  slightly 
to  show  that  it  does  not  depend  on  the  use  of  conjugate  gradient  iterations. 
Consider  the  general  iteration 


dj  = “^j  + ffjdj-l 


Xj+1  Xj  + ajdj 


s 


is  any  constant 

= 1»2, . . . 


(3.5a) 


*0  = -*o 


For  the  case  of  general  nonlinear  functions  we  can  ask  if  the  VSGCG 
method  will  produce  nonsingular  matrices,  preferably  positive  definite 
too.  We  use  the  following  result  of  Powell  [25]  for  Broyden' s P-class. 


i 


Theorem  5.2.  Let  be  symmetric  and  positive  semi-definite,  let 

s^  / 0 be  in  the  space  spanned  by  the  columns  of  H and  let  y^  be 
T 

such  that  > 0*  Also  assume  that  vector  w^  is  nonzero.  Then  if 
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f 


p > 


(ykVk)(Vk)2 


Qk(ykHkyk)(skHksk)  - (Vk)2 


= p 


(5.2) 


we  will  have  that  rajik(H^+1)  = rank(H^)  and  will  be  positive 

+ 

semidefinite.  H is  the  generalized  inverse  of  H.  It  can  be  shown 
that 

P = t‘PBFGS  + (l_t)PDFP  > P*  for  6 [0’1]  (3.3) 

T 2 

where  PgpQg  = (ykHyk/o^)  gives  rise  to  the  BFGS  method  and  PDFp  = 0 
to  the  DFP  method. 

If  is  positive  definite,  any  will  be  in  its  column 

T 

space;  and  all  we  have  to  insure  is  that  s^y^  > 0.  This  can  always 
be  done  by  performing  a sufficiently  accurate  line  search,  see  [4], 
will  then  be  positive  definite  and  the  VSGCG  will  be  a descent 
method  where  H is  a symmetric  and  positive  definite  matrix  and  a 

J 

is  a steplength.  Now  consider  the  interleaved  method  that  uses  iteration 
(3.4)  and  QN  steps  from  Broyden' s (3-class  (A-l).  Let  the  points  generated 
by  this  method  be 


x0  = *0,1"  X0,2’ ’ " **  X0,FQ~  Xl,l’  Xl,2* * * * * X1,Fl"  X2,1,***,XR,] 


where  the  QN  steps  were  performed  at  x,  - , i = 0,1, . . . . H is  the 

i,J-  i 

new  matrix  obtained  at  x.  ..  and  d the  displacement  from  x 

i>i  1>J  ij 


to  x 
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Now  we  shall  see  that  in  the  interleaved  method  one  obtains 
positive  definite  matrices  for  any  P satisfying  (3.3). 

Lemma  3.1.  Let  be  symmetric  and  positive  definite.  Consider  the 

interleaved  method  that  uses  QN  updates  and  the  general  iteration  (3.1c). 
If  sufficiently  accurate  line  searches  are  performed  (see  below)  one 
can  choose  £ £ (i,l,  i,2, . . . ,i,F-l}  such  that  the  following  is  true: 

If  Sj^  t and  y^  ^ are  used  in  (A.l)  to  obtain  H^+^,  and  if  w^^  ^ 0, 

P > p*  then  will  be  positive  definite. 


Proof. 


di ,£  = ”Hi®i,  l + CTi,idi,i-l 


(3>) 


Since  is  nonsingular  dj^  ^ is  in  its  column  space.  Let  us  drop 

the  first  index,  i.e.,  d^  s d^.  Now 


dl  = _Hgl'  dlgl  < 0 

d/gi  = _giHgl  + °id/-lg/ 

Let  the  accuracy  of  the  search  be  such  that 

ldJVl'  < "ln  i^J1!  f"*11  J <5-6> 

Then  from  (3.5),  d^g^  < 0 and 


25 


a£*i  ‘ aXe«*i  - dIe«J  > 0 • 

T 

Therefore  y£  > 0.  We  assume  that  > 0 for  all  j.  Applying 
Theorem  3.2  we  conclude  the  proof. 

The  lemma  tells  us  that  we  can  use  any  of  the  general  iteration 
steps  to  do  the  next  quasi-Newton  update  and  retain  positive  definiteness. 
In  order  to  obtain  termination  for  a quadratic  we  will  have  to  use  the 
last  step. 

Algorithm  3.1.  Consider  the  interleaved  method  that  uses  the  general 
iteration  (3.3a)  and  which  satisfies  the  following: 

(a)  in  the  previous  iteration  to  an  QN  step  an  exact  line  search 
is  employed. 

(b)  for  all  other  iterations  we  do  sufficiently  accurate  line  searches 
(in  the  sense  of  Lemma  3.1). 

(c)  the  QN  updates  use  the  last  displacement  vector. 

The  following  theorem  says  that  the  intermediate  steps  do  not 
undo  the  progress  of  the  QN  steps. 

Theorem  3.3.  If  Algorithm  3.1  is  applied  to  the  quadratic  function 
♦ (x)  = g (x-x  ) A(x-x  ),  starting  from  any  x^  then  the  minimum  x 
will  be  reached  after  r QN  steps,  where  0 < r < n. 
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Proof.  Let  Sik  = (v.AH^v  = v,  v H4g,  ^ = 0,  j = 1,2, ...,k).  First  we 


show  that 


l°l,j 


3lk  - 3l,kn  for  1Sl<fl  ' 


(5.7) 


Let  v € S 


i,K 

k 

xi,k»i ' xi,k  * ^ for  ■°"e  con,t*nt8  3j 

k 

gi,k+l  = gi,k  + ^=1  PJAHi6i,J 
v Higi,k+1  = v Higi,k  + ^ Pjgi,JHiAHiv 

= j,  pjgLv  - 0 • 

Therefore,  S^CS^^. 

Next  we  will  show  that  S.  „ , C S, To  simplify  the 

— i+i,i 

notation  we  will  drop  the  subindex  of  F,  i.e.,  x.  v = x.  _ . Let 
v e Si  F-l  then 

T 

AH^v  = v,  v Hjg^  j = 0,  j = 1,2,..., F-l  . 


Let  us  call  5 = 8i  F_1  and  r = F_1 

H rrTH  --T 

H.  ..  = H.  - + 0wwT 

i+1  i T„  IT 

r Hr  6 r 


w = a 


V 


5 


T„  _T 

r HAr  6 r 
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(3.8) 


AH.  H^H  v .rrT 

AH  -v  = AH  v + — Tfr-^  + B^wa 

11  1 r^r  b v 


r^H v t 

i b V 

T " T 
rHjT  6 r 


= v - AHl^  AH-^  + + BAwa 

r^r  5 r 


6 AH. v T 
i 6 v 

T " T 

r 5 r 


(5.9) 


T T 

Now  5 AH^v  = 5 v 


F-l 


6 ■ £ VA.j 


therefore  5 v = 0 . 


Using  this  in  (3.9)  we  obtain 


AHi+lV  = V * 


(3.10) 


T 

We  only  need  to  show  that  6^+iHiv  = ® 

F-l 


*1*1,1  ■ *i,F-l  + £ e3Hlei,j 

F-l 

gi+l,l  = gi,F-l  + 

F-l 

gi+l  Hi+lV  " gi,F-lHi+lV  + Pjgi, jHiAHi+lV 

= gi,F-lHi+lV  + ^ Pjgi,jHiV 

(from  3.10) 

= gi,F-lHi+lV 

= gi,F-lA  AHi+lV 

T -1 

= g,  „ ,A  AH.v  = 0 

6i,F-l  i 

(3.11) 
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Therefore  Si  f,_1  C Si+1  Note  that  we  have  not  used  the  exact  line 
search  hypothesis. 

Finally,  we  will  show  that  if  F was  obtained  by  doing  an 
exact  line  search  from  Xj^  p_1  then  C y From  the  quasi- 

Newton  equation,  5 = Hi+1r>  therefore 


AHi+lr  = A5  = r 


T 

5 i = ® (exact  search)  , 


(3.12) 


(Hi+1^  Hi+lgi+l,l  “ r Hi+lgi+l,l  ~ 0 ‘ (3.13) 


Hence  r £ Si+1  y Assume  that  r £ Si  f_1» 


Therefore 


gi,F-iV  = 0 ’ 


A HjT  = r 


T -1  T 
gi,F-lA  * = gi,F-l&  = ® * 


(3.1M 


This  is  not  possible;  in  the  proof  of  Lemma  3.1  it  is  shown  that  g^d^  < 0 

for  all  j.  This  shows  that  Si  p ^ C Si+1  So,  the  linear  spaces 

S.  . are  nondecreasing  and  their  dimension  increased  by  at  least  1 

l,  K 

after  the  QN-step. 


It  is  clear  from  the  proof  that  we  could  do  a QN  after  a step 

with  inexact  line  searches  and  this  would  not  destroy  the  termination 

property.  Theorem  3.3  can  be  paraphrased  as  follows:  after  at  most  n 

QN  steps  that  were  preceded  by  exact-line  search  steps  we  will  reach 
# 

the  minimum  x . The  BFGS  update  formula  is 


- /•» 


(3.15) 


Vl  = Hk 


+ ( 

S,  Yr.  ' 

k^k 


ykHkyk 


skyk 


Vk 


S,_  - 


vX 

T 

skyk 


which  we  will  write  as 


Vi  - \ * Vt  * vK 


0.16) 


where  the  vectors  a^  and  b^  are  found  by  comparing  with  (3.15). 

Theorem  (3.7)  is  a weak  result;  we  shall  now  show  that  these 
iterations  can  be  chosen  so  that  (n-step)  quadratic  termination  is  obtained. 


Algorithm  3.2.  Consider  the  Interleaved  method  that  uses  BFGS  steps 
and  conjugate  gradient  iterations  with  respect  to  the  newest  Hg^g 
matrix.  An  exact  line  search  is  performed  at  each  step. 


Theorem  3.4.  If  Algorithm  3.2  is  applied  to  the  quadratic  function 

*(x)  = | (x-x*)T  A(x-x*)  starting  from  any  xQ  € Rn  with  any  symmetric 

* 

and  positive  definite  HQ,  then  the  solution  x will  be  obtained  in  at 
most  n steps. 


Proof . The  Hestenes-Stiefel  CG  with  metric  H is 

T 


*>>  ■ -"“j  * [ / T1  ] sj-i 


(3.18) 


Let  x,  be  a point  generated  by  the  algorithm.  We  define 

J 
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j = displacement  generated  by  Algorithm  3.2  at  x ^ 

di,l  = "Higil  ' i = 2,3, R 

CG 

d.  = (3.18)  evaluated  at  x . 

1>  J i,J 

We  are  using  here  the  same  notation  as  in  Theorem  (3.3)  to  describe  when 

CG 

the  updates  are  done.  We  will  show  that  d = d.  (H ) for  all  (i, j) 
in  the  sequence.  In  other  words,  the  Interleaved  method  is  equivalent  to 
using  the  CG  with  metric  Hq  throughout. 

Assume  that  at  the  (j-l)  cycle: 


(1) 

^s,k 

■ Ck<v  * 

~ o, . . . , j-ij  k 1,...,FS| 

1 

(2) 

Hsgj-l,t 

= HOgi-l,t  ’ 

S ~ 0 , . . . , J “2  $ t = 1, . . . , F 

s 

f (3.19) 

(5) 

HJ-lgJ-l,t 

= V'j-ljt  ' 

t = 2,3,...,  > 

• 

The  assumption  is  clearly  true  for  j = 1.  Now  we  show  that  (3.19)  holds 

for  the  j-th  cycle.  Recall  that  g.  ..  „ . = g 

i-l,Fj-l  j,i 


J,1 


" HJgj,l  ‘ aj-lSj-l,F-lgj,l  + bj-lyj-l,F-lHj.lgj,l 

" HOgj,l  ' bj-lyj-l,F-lHOgj,l 
T 

- Vj,!  + -^'■ljF~lH°g^1  ^ (see  (3.15)-(3.16)) 


Sj-l,F-lyj-l,F-l 


(5.20) 
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: ^ 


Using  the  conjugacy  and  orthogonality  properties  of  the  CG  we  have 


Hlgj,2  ”0^,2  + a0s0Fgj,2  + b0'^0,FI\)gj,2  ^0gj,2 


Assume  that  H^g^  2 = HQgj  2 for  1 < k < j,  then 


^+1^,2  " ^^,2  + Vk,Fgj,2  + V^F*^8  j,2  H0gj,2 


Theri 


efore  Hk8j^2  = Hogj,2*  k = 


Now 


y]  1 H -i®  -i 


dCG  (H  ) = - H g + Ji1  .j  .Hi2  d 
j,2^  V 3*3,2  T . j,l 


yj,ldJ,l 

Hg  +ilH 

Og3,2  T 
y3, 1 

d5?2(H0)  * 


j-1 


Reasoning  as  before  we  can  show  that  for  k = 1,2, ...,j 


HkSj,  i Vj,i  ) 

j i = 2,3,..., 


d5?i<V  - d5?i(Ho) 


(3.21) 


So  that  (3.19)  holds  for  the  (j-th)  cycle.  The  proof  now  follows  from 
the  quadratic  termination  of  the  CG  method. 
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If  instead  of  BFGS  one  uses  some  other  member  of  Broyden' s 


1 


P-class  (A.l),  does  one  obtain  (n-step)  quadratic  termination? 

Lemma  3.2.  If  in  Algorithm  3.2  one  uses  a member  of  Broyden* s Class 
(A-l)  different  from  BFGS,  then  the  quadratic  termination  property  is 
lost. 

Proof.  We  write  (A.l)  in  a slightly  different  form 

H = H-^  + Hl+e(Hy  .is)(Hy.Lls)T  (3.23) 

yHy  sy  \ s y / \ sy/ 

/ T " 

where  = 1/y  Hy. 

Suppose  that  n-1  CG  steps,  (n-2)  with  metric  H have  been 
performed  and  that  the  solution  has  not  been  reached.  We  now  update  the 
matrix,  which  so  far  has  remained  unchanged.  From  Theorem  (3.*0  it 
follows  that  if  the  solution  is  to  be  obtained  in  the  next  step,  the 
new  direction  should  be  parallel  to  the  BFGS.  Let  s be  the  last  dis- 
placement and  g+  the  current  gradient.  Recall  that  we  are  doing  exact 
line  searches: 


T 

s g.  = 0 


"dBFGS  118 


/ yTHg+  yTHg  \ / yTHg  \ 

+ + Hy(-^-  + ^— )-s(  — j 

\ y Hy  y Hy  / \sy  / 


= Hg+  - 


T 

y Hg+ 
T 

s y 


Suppose  that  k'dBpQs=dB  ^or  some  numl:>er  k.  Then 

/ yTHg+  ? \ / yTflg+  \ T 

(k-l)Hg+  + Hy  ^ - Py  Hg+j  - s^-^ J (k  - Py  Hy)  = 0 . 


This  equation  has  the  form 

a1Hg+  + agHg  + = 0 . 

As  s = L where  the  summation  is  over  all  i such  that  g^ 

g+,  and  as  the  gradients  are  conjugate  with  respect  to  H we  have 
a^  = a^  = 0.  But  a ^ = 0 implies 


k = Py  Hy} 

T 

y Hg 

a,  = (k-l)  + - pyTHg+  = 0 ; 

y Hy 

T 

m m y Hg 

Py  Hy  - 1 = Py  Hg  - — j 

y Hy 

T T T 

T T y Hg  y Hy  - y Hg. 

P(y  Hy  -y  Hg+)  = 1 - — 


y Hy 


y Hy 


Therefore  P = (yTHy)-1  which  is  BFGS. 


(3.25) 


(3.26) 


precedes 


(3.27) 


3*+ 


The  above  results  show  again  the  special  relation  between  the  CG 
and  the  BFGS  methods  discussed  in  Interpretation  I p.  6 . The  proof  of 

Theorem  3.  ^ implies  the  following:  If  BPGS  and  CG  are  applied  to  a 
quadratic  function  using  the  same  initial  matrix,  and  if  exact  line 
searches  are  used  then  the  displacements  generated  by  the  two  algorithms 
are  the  same  in  direction  and  magnitude. 
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4.  Conjugate  Gradient  Methods  with  Inexact  Line  Searches 


In  this  section  we  discuss  variants  of  the  conjugate  gradient 
method  which  have  their  basis  in  the  interpretations  discussed  in 
Section  2 on  p.  6 to  10  . The  aim  of  each  variation  is  to  drop  the 
requirement  that  line  searches  be  exact,  and  still  retain  the  advan- 
tages of  the  basic  conjugate  gradient  method. 

We  shall  give  a brief  description  of  each  algorithm  in  Section 

4.1.1.  We  then  discuss  properties  of  these  algorithms,  concentrating 
on  the  three  term  recurrence. 


4.1.  Basics 

4.1.1.  Algorithms 

a)  Dixon’s  gradient  prediction  method  [24] 

The  first  method  along  these  lines  was  due  to  Dixon.  His  exten- 
sion of  the  conjugate  gradient  method  is  based  upon  the  following  result: 


Lemma  4.1.  Given  an  initial  point  xQ  and  a set  of  conjugate  directions 
<Jq,  d^,  ...  , d^  s.t.  <1q  — -g q and  d^  £ [Sq,  • ••>  ® 

sequence  of  points  x^,  x^,  ...  , xj+^  developed  s.t.  xj+^  = xj  + ^j^j’ 


where  is  an  arbitrary  step. 


Then  the  gradient  g1+1  at  the  minimum  point  x^+1  in  the 
affine  space  [Z  = [z:z  = xQ  + Zj_q  ajdj>  a e ®)  can  be  deduced  from 

g ^ Vf(x.)  and  the  search  directions,  and  is  given  by 

J J 
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and 


(4.1) 


Proof.  Implicit  in  Dixon  [24], 


can  be  used  to  develop  a search  direction  d^+,  parallel 
to  that  developed  by  the  c.g.  method.  From  (2.1)  this  is  achieved  by 


d 


i+1 


d. 


l 


(4.2) 


From  (4.1)  it  should  be  clear  that  only  two  additional  vectors  are 
needed  to  accumulate  the  corrections  to  g^+^  and 


b)  Meinoryless  Quasi-Newton  Methods,  Shanno  [21] 

These  are  based  upon  the  first  of  the  three  interpretations  in 
Section  2.1.2,  and  the  subsequent  discussion  of  Section  3.1.  It  should 
be  clear  from  (3.1)  and  (2.1)  that  the  restarted  BFGS  algorithm 
<<i  ■ T>  is  equivalent  to  the  Hestenes-Stiefel  conjugate  gradient 
algorithms  (2.1)  and  (2.3),  when  line  searches  are  exact.  Shanno 
carries  this  further  by  dropping  the  requirement  that  line  searches  be 
exact,  and  developing  search  directions  by 
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A number  of  additions  to  this  basic  algorithm  contribute  to  its  effective- 
ness— in  particular  strategies  for  scaling  and  restarting.  For  details 
see  [21]. 


c)  The  multistep  method,  Nazareth  and  Nocedal  [23] 

This  is  based  upon  the  second  of  the  three  interpretations  of 

Section  2.1.2.  As  noted  there  the  c.g.  method  develops  search  directions 

and  gradients  which  satisfy  (2.3a).  Let  us  consider  dropping  the  require- 

T 

ment  that  line  searches  be  exact  and  hence  the  fourth  relation  G G = P, 

but  let  us  still  insist  that  the  direction  gradient  relation  be  satisfied 

T 

-G  = DR,  and  that  directions  are  conjugate,  i.e.  D AD  = a.  In  addition 
the  matrix  H in  the  relation  ADA  = GH  must  be  redefined  as 


-1 

1 -1 


u 

1 


» e k (U.4) 


^n-l 

l n 


since 


gn+i  e tgi,...,gn]  and  gR+] / 0 in  general. 

The  multistep  method  is  based  upon  the  following  result,  given 


in  [25]. 
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Lemma  4.2.  Given  matrices  G,  D,  R,  H,  A,  a as  defined  above,  then 
R has  the  form: 


1 ® a a a 

1 ® P P 

l ® r 

• • 

R = 


where  elements  denoted  by  the  same  greek  letter  are  equal,  and  <g> 
denotes  an  element  which  is,  in  general,  non-zero. 

It  is  clear  that  for  quadratics  we  have  another  specialized 
version  of  Gram-Schmidt  orthogonalization.  Mainly  for  purposes  of 
illustration  a particular  algorithm  is  suggested  in  [25],  but  a number 
of  alternative  formulations  come  to  mind,  and  it  is  as  yet  unclear 
how  to  make  effective  use  of  Lemma  b.2  in  an  algorithm  for  non-linear 


optimization.  The  salient  point  however  is  this:  The  usual  c.g. 
method,  e.g.  Hestenes-Stief el  develops  a search  direction  in  [-gyd^ 

that  is  orthogonal  to  y . Lemma  k.2  suggests  that  it  might  be  worth- 

J 

while  to  maintain  a second  vector  c^.  composed  of  a suitable  linear 
combination  of  previous  directions,  and  an  associated  change  of  gradient 
fj_^,  and  to  develop  d^  € [ , dj_i>  and  orthoB°nal  to  Jrj_1 

and  tA  , . 
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J 


d)  The  three  term  recurrence,  Nazareth  [171 
As  we  shall  see,  in  the  following  section,  the  method  is  best 
thought  of  as  another  implicit  Lanczos  process  for  symmetric  matrices  (see 
p.  8).  The  defining  sections  for  these  three  term  recurrences  are  given  by 


J+l 


"yj  + 


j > o 


M) 


Xj+2  = xj+l  + ^jdj  when  A^  is  a function  reducing  step,  but  not 

necessarily  to  the  minimum  of  the  function  along 


0 ‘ 


Also  d ^ = 0 . 


Details  of  the  algorithm  are  given  in  [17].  Computational  experience 
reportsa  in  [27]  is  encouraging.  Shanno  [26]  however  does  not  obtain 
good  results  with  the  method.  A recent  hybrid  implementation  by  Gill 
and  Murray  [27]  combines  the  TTR  with  the  conjugate  gradient  method 
and  they  also  report  encouraging  computational  experience.  As  we  shall 
see  in  the  next  section,  the  TTR  method  has  certain  advantages  and  dis- 
advantages viz  a viz  the  CG  method,  and  what  is  clear  is  that  an 
effective  implementation  must  exploit  the  positive  aspects  and  circumvent 
the  negative  aspects  of  the  TTR. 


UO 


4.1.2.  Properties  and  Interpretations 

a)  When  line  searches  are  exact  then  the  gradient  prediction 
method  is  identical  to  the  conjugate  gradient  method.  This  holds  for 
arbitrary  functions.  This  property  also  implies  that  line  search 
criteria  can  be  found  which  ensure  that  the  GP  method  develops  descent 
directions,  see  Shanno  [26].  The  GP  method  has  finite  termination  on 
quadratics. 


b)  The  memory less  BFGS  method  does  not  have  finite  termination 

on  quadratics.  However,  by  virtue  of  it  being  a one  step  variable  metric 

method  it  is  clear  that  it  develops  descent  directions,  subject  to 
T 

> 0.  This  can  be  assured  by  the  line  search. 

c)  A straightforward  implementation  of  the  multi step  method, 
based  upon  Lemma  4.2,  as  described  in  [25],  will  not  assure  descent  for 
arbitrary  functions.  Such  a method  will  retain  quadratic  termination. 

As  noted  earlier,  we  believe  that  an  algorithm  which  exploits  Lemma  4.2 
in  a more  subtle  manner,  may  be  a very  useful  contribution. 

d)  We  now  discuss  a number  of  new  properties  of  the  TTR. 

The  first  observation  of  some  inyaortance  is  that  the  TTR  does  not 
require  that  dQ  be  along  g^,  in  order  to  develop  conjugate  directions. 
This  is  an  advantage  since  it  permits  restarts  of  the  algorithm  with 

d0  ^ ~g0*  * disadvantage  of  TTR  however  is  that  for  arbitrary  function 
dj+^  need  not  be  a descent  direction,  even  when  line  searches  are  exact. 

The  TTR  is  closely  related  to  the  Lanczos  process  for  tri- 
diagonalizing  a symmetric  matrix.  Since  y..  = Ad^  we  can  write 
(4.6)  as 


4l 


Define  D = (dQ, . . . , d^.)  t < n where  d^  ...,dt  on  the  set  of  conjugate 
directions  developed  before  the  algorithm  terminates,  i.e.  d^+1  = 0. 

Then  (4.7)  for  j = 0,1, ...,t  becomes 


and  by  conjugacy 


AD  = DT 


T 

D AD  = a 


(4.8) 


where  T is  a tridiagonal  matrix  and  a is  diagonal,  (4.8)  can  be 
rewritten  as 


D AD  = T 


D D = a 


(4.9) 


where  D = A^^D  and  T = CKT  is  symmetric  and  tridiagonal.  (4.9) 
defined  the  Lanczos  process. 

It  also  follows  directly  from  the  above  discussion  that 


V 


Id, 


Ad0' 


Thus  the  TTR  will  terminate  when  the  above  Krylov  sequence  d , Ad^, . . . 
attains  maximum  rank.  If  A has  only  k distinct  eigenvalues,  then 
there  can  be  at  most  k steps.  If  d^  = -gQ  then  the  TTR  with  exact 
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or  inexact  line  searches  will  generate  conjugate  directions  which  span 


the  same  spaces  as  would  the  CG  method  with  exact  line  searches.  When 
the  sequence  terminates,  the  correction  step  will  be  to  the  minimum. 

If  dg  / -gQ  then  the  TTR  can  terminate  (say  d^  = 0) , but  the  minimum 
need  not  lie  in  the  affine  space  (z;z  = xQ  + ^ 1 <Xdj,  cr . € 3R  } . This 
is  a disadvantage  of  the  method.  We  can  however  show  the  following. 


Lemma  4.1.  If  the  TTR  with  <1q  / -g^  and  exact  line  searches  terminates 
prematurely  at  x^,  k < n,  i.e.  gCx^  / 0 and  ^k  = ^hen  bC^) 
conjugate  to  d^, . . . , 

Proof.  gk  is  orthogonal  to  [ dQ, . . . , d^]  because  d, ...,  dk_1  are 
conjugate  and  line  searches  are  exact.  Also  by  (4.6)  y^  £ [d^dj, . . . ,d^+^]. 
Thus  y0,  ...  , yk_2  e [dQ, ...,  dk_1)  and  yk_1  £ [dQ, . . d^]  because 
the  process  has  terminated,  i.e.  = 0.  Therefore  [yQ, . . .^y^] 

= ( dQ, . . . , d^].  Thus  gk  is  orthogonal  to  [yQ,...,  y^],  i.e.  is 
conjugate  to  d^, ...,  <3k_^. 

Suppose  we  drop  the  requirement  that  line  searches  be  exact.  The 
next  lemma  shows  that  in  the  quadratic  case  a vector  can  be  maintained, 
which  permits  a restart  of  the  algorithm  when  premature  termination  occurs. 


Lemma  4.2.  Define 


n0  =g0 

n j € [ n and  orthogonal  to  y^. 


Then  is  orthogoanl  to  yQ, ...»  ^ and  n^  is  conjugate  to 


'O’  * * * * *J-2* 


Proof.  By  Induction,  suppose  n.  is  orthogonal  to  yn,  ...  , y . 

Also  by  conjugacy  d.  is  orthogonal  to  yn,  ...  , y, 

K U K-l 

Since  n^+^  £ [n^, d^]  and  is  chosen  to  be  orthogonal  to  y^, 
it  clearly  follows  that  n^+1  is  orthogonal  to  yQ,  ...  , y^.  Thus 

T 

njAfdQ,  ...  , dJ_1]  =0 
T 

(Anj)  = 0 

Since  [yQ,  ...  , y^g]  e [<3q>  > d^_^]  =»  An^  is  orthogonal  to 

y0>  ...  > ^j-2* 

The  above  lemmas  suggest  ways  of  modifying  TTR  in  order  to  circumvent 
its  disadvantages  discussed  above.  In  particular.  Lemma  i+. 1 justifies  the 
hybrid  implementation  of  Gill  and  Murray  [27],  and  demonstrates  that 
termination  of  this  implementation  will  occur  in  at  most  n steps  from 
xQ  (N.B.  not  Xjj) • We  will  however  defer  a more  detailed  discussion 
of  TTR  modified  along  the  lines  suggested  by  Lemma  4.1  and  4.2  so  as 
not  to  unduly  lengthen  an  already  long  paper. 

It  is  also  clear  that  methods  GP  and  multistep  discussed  in  the 
section  can  be  generalized  to  arbitrary  fixed  metric  and  arbitrary 
starting  directions,  and  that  much  can  be  said  about  strategies  of 
scaling  and  restarting.  Again  we  do  not  pursue  this  here. 


5.  Inexact  Line  Search  Methods  with  Variable  Metric 

In  Section  3 we  discussed  conjugate  gradient  methods  in  which 
the  metric  is  varied.  Similar  ideas  can  be  applied  to  the  methods  of 
Section  U.  Here  we  discuss  only  one  such  method,  which  seeks  to  avoid 
line  searches  in  the  conjugate  gradient  steps  and  retain  quadratic 
termination  by  using  the  TTR  method. 

Algorithm  3.1.  Consider  the  Interleaved  method  that  uses  BFGS  and  the 
TTR  method  in  the  following  way:  (a)  at  the  end  of  a sequence  of  TTR 
steps,  the  correction  step,  see  [17]  is  done,  and  (b)  every  BFGS  step 
is  performed  with  an  exact  line  search. 

Theorem  3.1.  If  Algorithm  5.1  is  applied  to  the  quadratic  function 
\|f(x)  = ± (x-x  ) A(x-x  ) starting  from  any  xQ  € Rn  and  any  symmetric 
and  positive  definite  matrix  Hq,  then  the  solution  will  be  obtained  in 
at  most  n-steps.  (As  the  correction  step  does  not  involve  function 
evaluations  it  is  not  counted  as  a step. ) 

Proof.  First  we  will  show  that  the  directions  generated  by  the  TTR 
are  parallel  to  the  conjugate  gradient  directions  of  dp  ■ -gg. 

* * 

We  will  denote  by  x^,  x 2,  ...  the  sequence  generated  by  the  CG 
and  Xp  x2,  ...  that  produced  by  the  TTR  method. 


* 


therefore 


gl  = gl  + (Vl)y0  = Vo  + V yO  = Vo 
yoTgi  = Votgi  + Vo  - yo] 

*T  T 

yo  ^ = Vodo  * 


Using  the  Hestenes-Stiefel  CG  we  have 


di  * "Vo  ' «o  + °b 


+ Q lyogi  * Voao  - VqI 


aoyodo 


do 


Vo  + 


T ^ T T 

, A yogi  + aoyoyo  - yoyo 

r 

yodo 


■Vo  + 


p T T T 

yoyo  Voyo  ~ y0y0 

T. 

yodo 


yoyo 


Vo  + aO  T <*0  ~ aodl 


where  d^,  dg,  ...  sure  given  by  (4.6). 

Induction:  Assume  that  d*  = a d , i = 1,2,...,  k 

J J-l  J 


From  (4.1) 


* £ gnid1 

Vl ' Vl ' A yJ 


(5.1) 
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T J 


T. 

* gkdk 

yk  = Vk  wlth  °k  = ‘ “P  • 

Vk 


*T  * 

1 * A 1 yk  yk 


“>  ** * "*  >wii  ^ 


*x  * 

1 yk-lyk 


4 [- 


yk  yk  ,* 


8k+l  + ®k  + *T  * \ 
yk  ^ 


*T  * 

+ yk-lyk  * 

*T  * “k-l 

V-A.i  J 


*T  * 

* yk-l®k  * 

?k  + *T  * \-l  * 

^-A-i 


A*  + Yk  Yk  ** 

_dk  *T  * \ 

yk  \ 


_ *T  * *t  * 

yk  **  + yk  yk  * 

«57 \ * 

- yk  ^k 


Using  (5.3)  and  the  orthogonality  of  gradients 


*T  * *T  * * *t  * 

_ yk  gk  + yk  gk+l  ykgk  yk  ®k-l  * 
*T  * \ = *T  * \ 


yk  ^ 


yk  dk 


Substituting  (5.3)  in  (5.2)  and  using  (5.4)  and  orthogonality 


(5.2) 


(5.3) 


(5.4) 
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-g 


k+1 


- dk  + 


*T 

'k- 


y^lgk 


*T 


-1^-] 


dk- 


*T  * 
yk  yk 
*T  * 

yk  dk 


*T  * 

yk-iyk 

* 


* 

Vi 


* 

* 

*T  * 

1 

* t y*  s*+i  a* 

+ _L 

yk-lgk 

* . yk-lyk  * 

“k 

®k+l  *T  * \ 

yk  \ 

* 

/k-l^-l 

\-l  *T  * dk-l 

yk-l\-l  J 

*T  * 

1_  * yk  ^+1  * 
a “®k+i  *T  * dk 
^ yk  \ 


Induction  holds.  Therefore  CG  and  the  TTR  generate  parallel  directions. 
After  applying  the  correction  step  the  TTR  method  produces  the  same  point 
as  the  CG  method.  Now  suppose  that  a cycle  of  TTR  steps  plus  correction 
step  is  completed  and  that  the  matrix  will  be  updated.  Let  s be  the 
last  step  of  the  TTR  method.  Then  s = As  where  s is  the  correspond- 

* I 

ing  CG  step  and  A is  a constant.  Then  y = Cs,  y = ACs,  so  y = - y . 

A 

Looking  at  the  HFGS  update  formula  one  readily  sees  that  it  is  equivalent 
to  use  (s,y)  or  (s  ,y  ).  Therefore  using  the  TTR  method  is  equivalent 
to  using  the  CG  method.  The  result  now  follows  from  Theorem  3.3. 

Similar  results  hold  for  the  other  methods  of  Section  4. 
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APPENDIX 


where 


Also 


Broyden's  ft-class 


Given  xQ  and  n x n matrix  1^,  we  let  k = 1,2,...  . 


Xk+1  " \ " W^k 


\ ^e  step  length  and  is  defined  recursively  by 

H.  - IL  . Vk  T 

Vl  " \ + ~T  + P wkwk  (A-1) 

Wk 


yksk 


_ _ w 

ykVk  " Vk 


wk  T 


P > 0 

Sk  = xk+l  " \ 

yk  " «k+l  ’ «k  when  «k  * Vf  1 * 
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